Combining Structure and Content Similarities for XML Document Clustering
نویسندگان
چکیده
This paper proposes a clustering approach that explores both the content and the structure of XML documents for determining similarity among them. Assuming that the content and the structure of XML documents play different roles and importance depending on the use and purpose of a dataset, the content and structure information of the documents are handled using two different similarity measuring methods. The similarity values produced from these two methods are then combined with weightings to measure the overall document similarity. The effect of structure similarity and content similarity on the clustering solution is thoroughly analysed. The experiments prove that clustering of the text-centric XML documents based on the content-only information produces a better solution in a homogeneous environment, documents that derived from one structural definition; however, in a heterogeneous environment, documents that derived from two or more structural definitions, clustering of the text-centric XML documents produces a better result when the structure and the content similarities of the documents are combined with different strengths.
منابع مشابه
خوشهبندی فراابتکاری اسناد فارسی اِکساِماِل مبتنی بر شباهت ساختاری و محتوایی
Due to the increasing number of documents, XML, effectively organize these documents in order to retrieve useful information from them is essential. A possible solution is performed on the clustering of XML documents in order to discover knowledge. Clustering XML documents is a key issue of how to measure the similarity between XML documents. Conventional clustering of text documents using a do...
متن کاملXml Document Probabilistic Clustering Based on Structure and Content
Large volume of information is stored in XML format in the Web, and clustering is a management method for this documents. Most of current methods for clustering XML documents consider only one of these two aspects. In this paper, we propose SCEM (Expectation Maximization Structure and Content) for XML documents which is used to effectively cluster XML documents by combining content and structur...
متن کاملخوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملRI structurée, RI et XML, RI précise
In this paper we present a clustering method for XML documents. Our step is twophase based: we first automatically extract the structure from the document; we then use it as model of representation to classify the document that it represents. The matching of the documents’ structures is based on the calculation of their similarities. For the experimentation we used the INEX. MOTS-CLÉS: Clusteri...
متن کاملApply Uncertainty in Document-Oriented Database (MongoDB) Using F-XML
As moving to big data world where data is increasing in unstructured way with high velocity, there is a need of data-store to store this bundle amount of data. Traditionally, relational databases are used which are now not compatible to handle this large amount of data, so it is needed to move on to non-relational data-stores. In the current study, we have proposed an extension of the Mongo...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008